Document Translation Retrieval Based on Statistical Machine Translation Techniques

نویسندگان

  • Felipe Sánchez-Martínez
  • Rafael C. Carrasco
چکیده

We compare different strategies to apply statistical machine translation techniques in order to retrieve documents which are a plausible translation of a given source document. Finding the translated version of a document is a relevant task, for example, when building a corpus of parallel texts that can help to create and to evaluate new machine translation systems. In contrast to the traditional settings in cross-language information retrieval tasks, in this case both the source and the target text are long and, thus, the procedure used to select what words or phrases will be included in the query has a key effect on the retrieval performance. In the statistical approach explored here, both the probability of the translation and the relevance of the terms are taken into account in order to build an effective query.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

ICT-Crossn: The System of Cross-lingual Information Retrieval of ICT in NTCIR-7

IR4QA is a new task in NTCIR-7, which intends to evaluate which IR techniques are more helpful to a QA system. This paper describes in detail the implementation of our IR4QA system, ICT-Crossn. The system consists of a query translation component that integrates the methods of phrase based statistical machine translation and OOV translation methods based on search engine, and a document retriev...

متن کامل

Should we Translate the Documents or the Queries in Cross-language Information Retrieval?

Previous comparisons of document and query translation suffered difficulty due to differing quality of machine translation in these two opposite directions. We avoid this difficulty by training identical statistical translation models for both translation directions using the same training data. We investigate information retrieval between English and French, incorporating both translations dir...

متن کامل

An Improvement in Cross-Language Document Retrieval Based on Statistical Models

This paper presents a proposed method integrated with three statistical models including Translation model, Query generation model and Document retrieval model for cross-language document retrieval. Given a certain document in the source language, it will be translated into the target language of statistical machine translation model. The query generation model then selects the most relevant wo...

متن کامل

Information Retrieval from Unstructured Web Text Document Based on Automatic Learning of the Threshold

Collocation is defined as a sequence of lexical tokens which habitually co-occur. This type of information is widely used in various applications such as Information Retrieval, document indexing, machine translation, lexicography, etc. Therefore, many techniques are developed for the automatic retrieval of collocations from textual documents. These techniques use statistical measures based on a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Applied Artificial Intelligence

دوره 25  شماره 

صفحات  -

تاریخ انتشار 2011